The American Journal of Human Genetics
○ Elsevier BV
Preprints posted in the last 30 days, ranked by how well they match The American Journal of Human Genetics's content profile, based on 206 papers previously published here. The average preprint has a 0.20% match score for this journal, so anything above that is already an above-average fit.
Lin, J.-R.; Zhang, Z.
Show abstract
Mosaic loss of chromosome Y (LOY) is a common age-associated somatic alteration in men and is typically measured from DNA-based assays. Many cohorts, however, contain bulk RNA-seq data without matched DNA-based LOY measurements. We developed a Bayesian framework to estimate the fraction of cells with LOY from male bulk RNA-seq by modeling reduced Y-linked gene expression relative to expected expression after adjustment for age, expression covariates, and autosomal/X-linked control genes. In 377 male GTEx samples, individual Y-linked genes showed negative correlations with separately obtained DNA-based LOY measurements, supporting a shared Y-expression depletion signal. The primary fast empirical Bayes estimator achieved a Pearson correlation of 0.678 with measured LOY, a mean absolute error of 1.79%, a root mean squared error of 3.72%, and 95.2% empirical coverage of measured LOY. Performance was strongest for identifying large LOY events, with an AUC of 0.964 for measured LOY greater than 20%, while fine ranking among low-LOY samples remained uncertain. A mixture/PCA hierarchical Bayesian sensitivity model provided similar validation performance and interpretable posterior quantities but did not improve point estimation. Leave-one-Y-gene-out and prior-sensitivity analyses showed that the signal was distributed across multiple Y-linked transcripts and that prior shrinkage affected calibration. In an external whole-blood RNA-seq dataset without measured LOY, estimated LOY showed a modest age-related increase, but ex vivo immune stimulation shifted RNA-derived LOY estimates and reduced multiple Y-linked transcripts, indicating transcriptional confounding. These results show that bulk RNA-seq contains usable information about LOY, especially for larger events, but RNA-derived LOY should be interpreted as a probabilistic transcriptome-based estimate rather than a direct substitute for DNA-based mosaicism measurement.
Zhang, L.; Paterson, A. D.; Sun, L.
Show abstract
Testing for Hardy-Weinberg equilibrium (HWE) is a fundamental component of genetic data analysis, widely used for quality control and model validation. Although HWE testing is well established for autosomal loci, inference on the X chromosome is more complex due to sex-specific genotype structures and potential sex differences in minor allele frequency (sdMAF). Existing tests differ in their assumptions about sdMAF and male sample inclusion, often leading to distinct but poorly characterized null hypotheses. We develop a general statistical framework for HWE inference using the robust allele-based regression model. By formulating HWE testing as an assessment of allele-level dependence, the framework directly parameterizes Hardy-Weinberg disequilibrium, unifies existing Pearson{chi} 2-based tests under explicit modeling assumptions, and clarifies their null hypotheses, degrees of freedom, and sensitivity to sdMAF. The framework also accommodates covariate and population-structure adjustment within a unified regression-based formulation. The proposed framework provides robust, interpretable, and flexible inference, establishing a unified statistical foundation for HWE testing across autosomal and X-chromosomal regions. Simulation studies and analysis of high-coverage 1000 Genomes Project data demonstrate that commonly used X-chromosome tests can exhibit inflated type I error or misleading inference when sdMAF is present.
Manirakiza, A. V.; Baichoo, S.; Uwineza, A.; Dukundane, D.; Rugengamanzi, E.; Mutamuliza, J.; Niragira, A.; Muvunyi, R.; Besada, J.; Nielsen, S.; Bucknor, B.; Koeller, D. R.; Andrews, C.; Mutesa, L.; Fadelu, T.; Rebbeck, T. R.
Show abstract
Germline data from African populations remain sparse, limiting characterization of population-specific BRCA1/2 pathogenic variants. In a study of 175 Rwandan women with breast cancer, 7 unrelated carriers (4% of cases; 22% of pathogenic variant carriers) harbored the same BRCA1 frameshift variant, c.4065_4068del (p.Asn1355Lysfs*10), which is extremely rare in gnomAD yet recurrent in European, Asian, and Middle Eastern cohorts. Whole-exome sequencing and haplotype analysis of all 7 carriers revealed a shared ancestral block of approximately 581 kb surrounding the variant, and extended haplotype homozygosity and network analyses confirmed a common founder origin. Coalescent-based age estimation placed the founder event approximately 4,000--10,000 years ago. Comparison with 1000 Genomes Project data showed the founder haplotype is absent or exceedingly rare outside African and South Asian populations. These findings strongly suggest the c.4065_4068del variant as a pre-historical BRCA1 founder variant in Rwanda, with implications for targeted genetic testing, cascade screening, and cancer prevention in the region.
Chang, X.; Hou, S.; Zhou, X.
Show abstract
Calibrated prediction intervals for polygenic scores (PGS) are essential for communicating individual-level uncertainty in genomic medicine. We present updated comparisons of two methods for constructing such intervals: CalPred, a parametric approach, and PredInterval, a non-parametric approach. Our results show that both methods can achieve calibrated coverage, although CalPred additionally requires a sufficiently large calibration set. The two methods also exhibit complementary trade-offs with respect to dataset size and risk identification. We further show that contextual calibration, as introduced in Hou et al. and followed in Shi et al., is most naturally achieved through appropriate phenotype normalization and data preprocessing. Apparent miscalibration can arise from inadequate normalization or from providing contextual information to some methods but not others. In UK Biobank, standard GWAS phenotype normalization procedures are sufficient to achieve contextual calibration for traits analyzed. In the extreme simulations of Hou et al. and Shi et al., supplying contextual covariates to PredInterval restores contextual calibration without normalization, and appropriate normalization can achieve contextual calibration without supplying covariates, while also substantially improving upstream tasks including association power and PGS accuracy. Together, these results underscore the central role of phenotype normalization and data preprocessing in GWAS analyses, including reliable uncertainty quantification for PGS.
Motegi, T.; Huang, F.; Campbell, J. D.
Show abstract
Local ancestry inference (LAI) enables high-resolution characterization of chromosomal segments inherited from distinct ancestral populations, offering unique insights into genetic architecture in admixed cohorts. While LAI is commonly performed with high-coverage whole-genome sequencing (WGS), the ability of other genotyping assays or varying sequencing depths has not been thoroughly benchmarked. In this study, we systematically evaluated the accuracy of LAI across SNP microarrays, whole-exome sequencing (WES), and ultra low-pass WGS (ULP-WGS) using diverse validation samples and state-of-the-art imputation pipelines. We show that ULP-WGS, when paired with GLIMPSE2, achieves robust accuracy at 0.25x coverage with a minimum genome window size of 0.5 centimorgans, with mean accuracy minus one standard deviation exceeding 95%. For WES, using "on-target" reads alone yields suboptimal performance, particularly for European and South Asian ancestries with accuracy less than 79.1% and 70.6%, respectively. However, incorporating "off-target" reads in WES and utilizing GLIMPSE2 substantially improved accuracy [≥]95% with a minimum window size of 0.2 centimorgans. We further evaluated formalin-fixed, paraffin-embedded (FFPE) samples and found that LAI could be performed successfully using WES data with accuracies of [≥]95% at a minimum window size of 0.5 centimorgans. In contrast, SNP microarrays did not achieve substantial accuracies at any window size ([≤]95%). Together, these results demonstrate that LAI is achievable without conventional high-coverage WGS and establish optimal parameters for LAI across platforms.
Cai, L.; DeBerardinis, R. J.
Show abstract
Heterozygous carriers of autosomal recessive disease variants are conventionally considered unaffected, yet population-scale genomic datasets reveal subclinical carrier phenotypes. MMACHC encodes a cobalamin-processing protein whose biallelic loss causes cobalamin C deficiency, an inborn error of intracellular cobalamin metabolism. We performed an unbiased quantitative phenome-wide association screen in All of Us Research Program v8 to identify phenotypes associated with rare heterozygous MMACHC burden variants. Serum/plasma vitamin B12 was the top quantitative association. Carriers had higher circulating B12 than non-carriers in adjusted analyses, but also higher homocysteine, suggesting that elevated circulating B12 does not reflect improved intracellular cobalamin function. Carriers were less likely to fall below conventional B12 insufficiency thresholds, indicating a potential diagnostic blind spot. A pathway-wide rare-variant gene-burden (All-by-All) gene-burden analysis placed this finding in broader biological context. Burdens in genes related to circulating B12 binding or intestinal absorption were associated with lower circulating B12. In contrast, burdens in several genes involved in cellular delivery and intracellular cobalamin handling were associated with higher circulating B12. This step-specific directionality supports a model in which elevated circulating B12 can reflect impaired cellular handling and consequent systemic accumulation rather than improved cellular cobalamin availability. Because EHR-derived B12 is shaped by heterogeneous clinical and medication contexts, prospective carrier-enriched studies with standardized methylmalonic acid, homocysteine, diet, supplement, medication, comorbidity, and symptom ascertainment are needed to evaluate functional-marker-based screening.
Yerukala Sathipati, S.; Scott, H.
Show abstract
Importance: Hereditary breast and ovarian cancer (HBOC) variant carriers benefit from risk-reducing interventions, but only if identified. The extent to which carriers are clinically recognized, and whether recognition is equitable across diverse populations, is poorly characterized in a single large U.S. cohort. Objective: To estimate P/LP HBOC carrier prevalence across genetic ancestry groups, quantify documented clinical genetic testing among carriers, and evaluate ancestry and socioeconomic disparities in testing. Design, Setting, and Participants: Cross-sectional analysis of the All of Us Research Program Controlled Tier (Curated Data Repository v8/C2024Q3R9), comprising participants with short-read whole genome sequencing and linked electronic health record (EHR) and survey data. Carriers were ascertained from research genomic data independent of clinical testing. Exposures: Genetically inferred ancestry (African [AFR], Admixed American [AMR], East Asian [EAS], European [EUR], Middle Eastern [MID], South Asian [SAS]); self-reported household income and educational attainment. Main Outcomes and Measures: (1) Carrier prevalence with Wilson 95% CIs; (2) documented clinical genetic testing (procedure codes) among carriers; (3) adjusted odds of documented testing among women, by ancestry, before and after socioeconomic adjustment, using multivariable logistic regression. Results: Among 414,830 participants, P/LP HBOC carrier prevalence was 1.42% (95% CI, 1.38-1.45) overall and similar across ancestry groups (AFR 1.24%, AMR 1.32%, EAS 1.19%, EUR 1.52%, MID 1.68%, SAS 1.33%; overlapping CIs). Among 250,071 women in the testing analysis, documented clinical genetic testing was rare: only 74 of 5,878 carriers overall (1.3%) and 59 of 3,572 European-ancestry carriers (1.7%) had a documented test, with counts below reportable thresholds in all other ancestry groups. African-ancestry women had lower adjusted odds of documented testing than European-ancestry women (Model 1 adjusted odds ratio [aOR], 0.32; 95% CI, 0.27-0.39), an association that attenuated but persisted after adjustment for income and education (Model 2 aOR, 0.48; 95% CI, 0.40-0.58; P < 0.001); Admixed American women also had reduced adjusted odds (aOR, 0.71; 95% CI, 0.61-0.84). Lower income and lower education were independently and dose-dependently associated with lower testing odds (income <$25,000 aOR, 0.46; high-school education aOR, 0.54). Conclusions and Relevance: High-risk HBOC variant carriers are present across all ancestry groups at similar frequencies, yet documented clinical genetic testing was disparate in the different ancestry groups. African-ancestry women experience a testing gap that is not fully explained by socioeconomic position, implicating structural barriers in access and referral. Population-level strategies that decouple carrier identification from current referral pathways may be required to close this gap.
Wolfram, T.; Ahangari, M.; Davidson, I.; Wartschinski, L.; Li, J. H.; Eyre, M.; Stern, D.; Schleede, J.; Haghighi, A.; Carmi, S.; Christensen, M.
Show abstract
Consanguinity is a reproductive union between individuals who share a recent common ancestor. These unions are common in many regions of the world and increase the burden of rare recessive disorders by elevating autozygosity in offspring. Current reproductive genetic screening focuses on a limited set of known pathogenic variants, leaving most recessive risk unaddressed. Here we argue that embryo-level autozygosity, quantified as the fraction of the genome in long runs of homozygosity (FROH), is a potentially actionable genomic biomarker that can be integrated into routine preimplantation genetic testing as a homozygosity-informed embryo-prioritization framework (PGT-H) that can be layered onto existing embryo biopsy workflows when couples are already undergoing IVF with PGT-A or PGT-M. Using forward simulations of first-cousin and double-first-cousin couples, we show that siblings conceived by the same couple span a wide range of FROH; selecting the lowest-FROH candidate from a cohort of five embryos reduces FROH by approximately 40% on average. Combining these reductions with empirical effect-size estimates, we estimate that for first-cousin couples this strategy could reduce risk of intellectual disability by roughly 35-45% (corresponding to an absolute risk reduction of about 1.8-2.2%) and potentially reduce excess recessive disease burden, while also modestly reducing risk of common diseases such as type 2 diabetes. We outline how existing PGT-A and PGT-M workflows could potentially be extended to report embryo-level FROH and discuss ethical and counseling considerations. Autozygosity-based embryo prioritization offers a principled way to address a component of recessive risk that current variant-centric approaches miss.
Jaishankar, D.; Gjorgjieva, T.; Jala, J.; Swigert, J.; Young, A. S.; Benjamin, D. J.; Cesarini, D. A.; Turley, P.
Show abstract
We introduce a novel approach, Genomic-Relatedness-Matched Association (GRMA) studies, as an alternative to genome-wide association studies (GWAS). GWAS are typically restricted to samples of mostly unrelated individuals with a single, shared continental ancestry and nevertheless can still be biased by gene-environment correlation and assortative mating. In contrast, GRMA can be implemented in ancestrally diverse samples--retaining individuals of mixed or underrepresented ancestries and eliminating the need to assign labels to ancestry groups--and can reduce bias relative to standard GWAS. GRMA matches each individual to a group of controls whose pairwise relatedness with the individual exceeds a user-specified threshold. It generates SNP-level summary statistics based on within-group associations. In applications using the UK Biobank and All of Us data, we find that GRMA compares favorably to GWAS methods in terms of bias, precision, and population coverage. GRMA enables several novel findings; for example, we find that "genetic nurture" is unlikely to be an important source of genome-wide bias in population GWAS of body mass index, height, and educational attainment. The method is computationally efficient and supported by open-source software, facilitating its application in large-scale scientific and health-related studies.
Johansson, P. A.; Brooks, K.; Palmer, J. M.; Nathan, V.; Xu, M.; Scales, J. L.; Hennessey, R.; Holland, E. A.; Harland, M.; Hutchison, S.; Chan, P. Y.; Sankar, A.; Papiernik, S.; Dennis, A.; Thakur, R.; Chari, R.; Schmid, H.; Law, M. H.; Curnow, L.; Howlie, M.; Rodgers, C. B.; Mustard, C.; Bishop, T. D.; Newton-Bishop, J.; Mann, G. J.; Cust, A. E.; Adams, D. J.; Brown, K. M.; Hayward, N. K.; Pritchard, A. L.
Show abstract
Deleterious CDKN2A germline variants account for ~40% of familial melanoma cases, while rare variants in CDK4, BAP1, and telomere-maintenance genes collectively attribute ~10% of familial risk. We sought to identify new high-penetrance susceptibility variants by sequencing 305 melanoma cases from 89 multi-case families negative for known predisposition gene variants. In one family, cutaneous melanoma co-segregated with a rare variant in DMRTA1 (p.Glu383Gln), located less than 480 kb upstream of CDKN2A on chromosome 9. Whole-genome sequencing then revealed an intergenic 234kb deletion that co-segregated with melanoma in 18 out of 21 cases across four generations. Further investigations revealed a further 10 families carrying this deletion, co-segregating with melanoma. The deleted region was predicted to encompass regulatory sequences and to interact with the CDKN2A promoter region. Tiled CRISPR inhibition of the predicted enhancer region confirmed interactions between the distant upstream deletion with CDKN2A resulting in decreased p16 transcript mRNA expression. Deletion carriers exhibited nearcomplete loss of p16 mRNA expression from the affected chromosome. This distant noncoding deletion is one of the most common founder variants predisposing to melanoma and reveals a new mechanism controlling p16 expression. Routine screening for this deletion in individuals with perceived high risk of melanoma is warranted.
Lalli, J. L.; Bortvin, A. N.; McCoy, R. C.; Werling, D. M.
Show abstract
The T2T-CHM13 complete human reference genome contains [~]200 Mb of previously unresolved sequence, improving read mapping and variant calling compared to GRCh38. However, the benefits of using complete reference genomes for phasing and imputation are unclear. Here, we present a reference T2T-CHM13 recombination map and phased haplotype panel derived from 3,202 samples from the 1000 Genomes Project (1kGP). Using published long-read based assemblies as a reference-neutral ground truth, we compared our T2T-CHM13 1kGP panel to the previously released GRCh38 1kGP panel. We found that alignment to T2T-CHM13 resulted in 38% fewer assembly-discordant SNP genotypes and 16% fewer phasing switch errors. The largest gains in panel accuracy were observed on chromosome X and in the regions flanking loci prone to disease-causing CNVs. Moreover, downsampled genomes from the Simons Genome Diversity Project attained higher imputation accuracy when using the T2T-CHM13 versus GRCh38 1kGP panel. Our study demonstrates that use of the T2T-CHM13 phased haplotype panel improves statistical phasing and imputation for samples from diverse human populations.
Iminitoff, M.; Le Fevre, A.; Cameron, T.; Vanyai, H. K.; Chew, C.; Kinkel, S. A.; Breslin, K.; Theiss, S.; Gouil, Q.; Thomas, M.; Murphy, J. M.; Schaaf, C. P.; Keniry, A.; Blewitt, M. E.
Show abstract
Prader-Willi Syndrome (PWS) is a neurodevelopmental disorder caused by lack of gene expression from the active paternal allele at an imprinted gene cluster on chromosome 15. Current treatments have limited efficacy as they target individual symptoms rather than the underlying cause of disease. All patients preserve a normal, yet epigenetically-silenced, copy of the PWS cluster genes; activation of this imprinted copy to restore necessary gene expression is an appealing option for tackling the root of the disorder. Here we have addressed the potential to activate these silent maternal genes by targeting the epigenetic regulator Structural Maintenance of Chromosomes Hinge domain containing 1 (SMCHD1). First, we expanded the role of SMCHD1 in repressing the PWS cluster from mice to humans, a critical step if SMCHD1 is to be a drug target. Second, we discovered that SMCHD1 represses the entire PWS locus in neural lineages, extending its previously known role at only half of the PWS genes. We show that deleting Smchd1 after early development in vivo is effective at causing PWS gene-activation in disease-relevant mouse tissues including hypothalamus, and that this has beneficial effects on phenotypes observed in a PWS mouse model. Despite SMCHD1 having a role in gene silencing elsewhere in the genome, our data suggest that targeting SMCHD1 after early development is remarkably safe. Taken together, these data propose SMCHD1 as a novel target for gene-activation therapy for PWS.
Yun, Y.; Hao, X.; Zhang, Y. D.
Show abstract
Quantifying uncertainty in polygenic score (PGS)-based phenotype prediction is crucial for the integration of genomic data into precision medicine. While the PGS provides a fundamental pivot for point estimation, clinical decision-making necessitates the construction of well-calibrated prediction intervals that reliably encompass the true phenotypic values. However, phenotypic residuals are frequently characterized by complex heteroscedasticity and stratified variance structures across diverse demographic contexts. Existing approaches often rely on global calibration mechanisms, which fail to account for such localized variance structures and lead to systematic miscalibration within specific subpopulations. To bridge this gap, we propose Clustering-based Split Conformal Prediction with Normalized Residuals (C-SCNR), a versatile framework based on Split Conformal Prediction. By adopting residual normalization and incorporating a repetitive `split-and-cluster` mechanism, C-SCNR dynamically identifies latent error strata and applies fine-grained adjustments to the resulting intervals. Our framework requires no distributional assumptions regarding the phenotype, is compatible with any PGS method, and flexibly accommodates biologically-informed grouping. Simulation studies demonstrate that our framework consistently outperforms existing methods across diverse error distributions. In real-data applications analyzing Body mass index (BMI), Low-density lipoprotein (LDL) cholesterol, and High-density lipoprotein (HDL) cholesterol in the UK Biobank, C-SCNR effectively resolves the coverage deficiencies of existing methods in specific subgroups and consistently yields superior localized calibration. Overall, C-SCNR represents a flexible and powerful framework for constructing high-resolution context-specific prediction intervals, thereby facilitating more reliable clinical interpretations of polygenic risk.
Guo, B.; Naseri, A.; Xie, Z.; Sarnowski, C.; Zhi, D.; Chen, H.
Show abstract
Although traditional genome-wide association studies (GWAS) have identified numerous loci, they often ignore phased haplotype information. Identity-by-descent (IBD) mapping captures these extended haplotypic effects by modeling shared ancestral segments. However, standard statistical mapping of these segments scales poorly with biobank-sized cohorts and short IBD segments that capture older evolutionary events. To overcome this computational bottleneck, existing scalable IBD mapping frameworks aggregate shared segments into fixed sliding windows. While computationally efficient, this window-based approach generates association signals at a low resolution that often span hundreds of kilobases. To address this issue, here we present a novel High-resolution Fast IBD Mapping test (HiFiMAP) that takes snapshots of IBD segments at the single nucleotide polymorphism (SNP) level resolution. Simulation studies confirm that HiFiMAP maintains well-controlled type I error rates and exhibits superior statistical power for detecting rare variants and haplotype effects using short IBD segments. In a UK Biobank (UKB) benchmark (N=407,681), HiFiMAP mapped 640,899 SNPs at 1.92 CPU seconds per test, massively outperforming existing window-based methods (95.2 CPU seconds per test for 3,403 windows). Furthermore, applied to high-dimensional brain imaging phenotypes (N~36,000), HiFiMAP identified five novel associations previously undetected by standard GWAS approaches, including key central nervous system regulators like NR2F1 and NSF/WNT3. By refining large testing windows into highly specific genomic variants, HiFiMAP empowers biobank-scale, SNP-level resolution mapping to accurately pinpoint complex trait architectures.
Pham, B. K.; Davenport, S.; Azriel, D.; Schwartzman, A.
Show abstract
LD Score Regression (LDSC) is a prominent method, which estimates whole-genome SNP heritability from summary statistics via the slope of a linear regression of GWAS test statistics corresponding to a trait of interest against LD scores. It was claimed by the LDSC authors that the free intercept in the regression accounts for confounding bias such as population stratification. In this study, we argue that the intercept in LDSC must be fixed to 1 for accurate SNP heritability estimation. We show both theoretically and with simulations that the estimated intercept does not accurately capture population stratification effects, and that it adversely affects the accuracy of the heritability estimate introducing bias and increasing variance. Fixing the intercept to 1 eliminates bias and reduces variance when no population stratification is present. On the other hand, under population stratification, LDSC is biased with both the free and the fixed intercept. Additionally, we show that estimated standard errors in LDSC are underestimated, potentially leading to false-positives in downstream GWAS analyses.
Liang, M.; Wu, R.; Xiao, F.; Li, X.
Show abstract
Mendelian randomization (MR) is widely used to draw causal conclusions in the presence of unmeasured confounding, but most MR analyses focus on average treatment effects and rely on strong assumptions. For precision medicine, the primary target is instead the individualized treatment effect (ITE); yet in MR, such effects are not point-identified under core IV assumptions, and valid inference is particularly challenging. We therefore propose a robust partial identification inference framework for ITE under MR allowing multiple instruments. Under minimal causal assumptions, we derive a sharp inference procedure for the intersection bounds of ITE by adopting a multiplier bootstrap procedure with data-adaptive bootstrap distribution shifting and heterogeneous variance adjustment. In theory, we prove that the proposed method achieves nominal coverage and asymptotic sharpness. Further, we extend the procedure to tolerate possible invalid IVs under a minimal proportion rule assumption by aggregating over instrument subsets while preserving coverage. Simulation studies demonstrate that the proposed methods attain nominal coverage and substantially shorter intervals than existing procedures. We illustrate the framework using data from the Alzheimers Disease Neuroimaging Initiative to assess heterogeneous causal effects of TREM2 expression on Alzheimers disease risk across education-defined subgroups.
Preussner, A.; Leinonen, J. T.; FinnGen, ; Pirinen, M.; Tukiainen, T.
Show abstract
Although the Y chromosome represents roughly 2% of the male genome, it is often ignored in genome-wide association studies (GWAS). Subsequently, the potential health impacts of Y-chromosomal genetic variation remain incompletely understood. To fill this gap, we performed a phenome-wide association study (PheWAS) in FinnGen across 1,426 binary and quantitative traits using Y-chromosomal variation (frequency [≥] 1%) in 104,334 genotyped men. As Y chromosome variation is prone to population stratification, we performed carefully adjusted association analyses and further examined these through kin-based validation in 19,275 female and 24,712 male 1st degree relatives. We found 121 suggestive (p < 5.6x10-3) phenotypic associations in the Y chromosome, yet none of these were strong enough to reach phenome-wide significance (p < 3.9x10-6). While only 38 associations were supported in the kin-based validation, intriguingly we found support for a previously suggested link between haplogroup I1 and coronary heart disease (CHD; OR=1.06, 95%CI=1.02-1.11, p=3.7x10-3; male validation OR=1.05; female validation OR=0.97). The I1-CHD association was detected across distinct geographical areas within Finland and was independent from Loss of Y (LOY) and the autosomal risk to CHD, proposing a link between germline Y-chromosomal variation and heart disease risk. Overall, this study presents a comprehensive phenome-wide analysis of Y-chromosomal associations, highlighting the potential relevance of Y-chromosomal variation beyond sex determination. Our findings further emphasize the need for improved capture of Y-chromosomal variants and further analyses in biobank-scale data to allow for deeper exploration of male-specific genetic architecture of complex diseases.
Zhang, Y.; Zhang, R.; Ge, T.
Show abstract
Polygenic risk scores (PRSs) hold promise for precision medicine, yet their clinical translation is hindered by substantial uncertainty in individual risk estimates and often limited agreement in risk stratification across multiple PRSs for the same disease. We develop a unified inferential framework to calibrate PRS point estimates and uncertainties for both quantitative traits and binary phenotypes, and to characterize how PRS accuracy, uncertainty, pairwise correlation jointly determine misclassification and classification inconsistency. We show, both theoretically and empirically, that individual- and population-level misclassification and inconsistency rates are highly predictable in independent datasets. We further evaluate PRS integration and uncertainty-aware probabilistic thresholding strategies that reduce misclassification and improve concordance in risk stratification. Together, these results demonstrate that instability in PRS-based classification is a predictable statistical consequence of uncertainty and establish a principled foundation for incorporating uncertainty into PRS-based risk interpretation, communication, and clinical decision-making.
Bazemore, K.; Iqbal, T.; Kuzma, A. B.; Grant, S. F. A.; Schellenberg, G. D.; Wang, L.-S.; Chesi, A.; Jin, J.; Naj, A. C.
Show abstract
Pathway-specific polygenic risk scores (pathway-PRS) measure aggregate genetic risk across single nucleotide variants (SNVs) annotated to genes in a pathway of interest. In most applications, SNV-to-gene annotation is based on SNV position with respect to gene boundaries. This approach is ill-suited for incorporating non-coding SNVs, which can regulate gene expression over long distances and represent a large proportion of risk variants for Alzheimer's disease (AD). Here, we compare the performance of AD pathway-PRS across SNV-to-gene annotation strategies that integrate varying levels of functional genomic data, including adult brain chromatin interaction and expression quantitative trait loci (eQTL) data. In the UK Biobank (n=328,526), including AD cases defined by ICD-9/10 codes (n=3,043) and by family history of AD/dementia (n=38,589), we show that the annotation strategy integrating chromatin interaction and eQTL data consistently improves pathway-PRS performance. We replicate this finding in independent data from the Alzheimer's Disease Genetics Consortium (n=3,370). We further find that pathway-PRS associations with AD vary by annotation strategy and that power to detect sex-dependent and age-at-onset associations is increased with integrative annotation. Together, these findings support the use of functionally informed SNV-to-gene annotation for pathway-PRS construction and highlight the importance of applying multiple annotation strategies for robust inference.
Chen, T.; Li, X.; Mazumder, R.; Zhang, H.; Lin, X.
Show abstract
Whole-exome and whole-genome sequencing technology has enabled the discovery of rare genetic variants associated with human health and diseases. However, existing statistical methods used for rare variant association testing are not well-suited for building genetic risk prediction models that jointly incorporate rare and common variants. We propose STELLAR, a flexible ensemble learning-based approach to compute rare variant polygenic risk scores (PRS) using association summary statistics to enhance conventional common variant PRS. Our method combines burden-based and penalty-based rare variant analysis and leverages functional annotation information to prioritize potentially causal variants within the prediction models. In simulation studies, PRS using STELLAR consistently showed the highest prediction accuracy compared to models using common variants alone or rare variant burdens. Applied to UK Biobank whole-exome sequencing data (n=310,831) across eight continuous and five binary traits, STELLAR significantly improved prediction accuracy, refined stratification of individuals at the highest genetic risk beyond common variants, and prioritized biologically relevant genes. STELLAR provides a scalable strategy to incorporate rare variants into PRS in addition to common variants, advancing precision risk prediction and enabling more comprehensive assessment of genetic contributions to complex diseases.